Linear Regression Notes

These notes rely on a variety of sources:

A Motivating Example

Necessary thought exercise #1

Why is this exercise necessary?

In Machine Learning, we ask the question of "how would you figure this out" via a three step process:

  1. We define a model class (or synonymously, a hypotheses class). We will pick one of those to be our predictive model.
  2. We define a loss function. This defines the "best" model, i.e. the one we're going to pick.
  3. We define an optimization algorithm. This is how we actually find the best model.

Digging deeper... Model Class

A model class, or hypothesis class, sometimes represented as $\mathcal{H}$, is the set of all possible functions that we could possibly learn, and select for our final model.

Some desireable properties of a model class:

Digging deeper... Loss function

From Weinberger's ML course:

We try to find a function $h$ within the hypothesis class that makes the fewest mistakes within our training data... How can we find the best function? For this we need some way to evaluate what it means for one function to be better than another. This is where the loss function (aka risk function) comes in. A loss function evaluates a hypothesis $h \in \mathcal{H}$ on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions.

Let's call our loss function (for a given hypothesis) $\mathcal{L}(h)$.

Digging deeper ... Optimization Algorithm

Our goal now is just to find $h^* \in \mathcal{H}$ such that:

$$h^* = \mathop{\mathrm{argmin}}_{h \in \mathcal{H}} \mathcal{L}(h)$$

Motivating Linear Regression

One way to motivate/understand (ordinary least squares) linear regression is that it defines a specific model class, with a specific loss function, and has a closed-form approach to optimization. However, I'm going to take a slightly different motivating approach first, and then circle back to this.

Necessary thought exercise #2

OK, now... what if you could only make 1 guess for both of them?

The "Predict Only One" Best Answer

This section largely follows Shalizi, Chapter 1

As we saw above, to say one model is "better" than another, we need some notion of "better". For reasons that will become clear as we move through this course, a typical notion of "better" is the mean squared error (MSE). For a true value $y$ and a predicted value $\hat{y}$, the MSE of the prediction is: $$MSE(\hat{y}) = (y-\hat{y})^2$$

Exercises:

Now, let's assume we're trying to predict the value of some random variable $Y$, and our goal is to make the best single prediction for it. Let's use MSE as our error metric. Because $Y$ is a random variable, we cannot specify an exact value for this MSE. Instead, we have to think about the expectation of that MSE. Let's keep $\hat{y}$ as our prediction. Then our goal is to find the best possible value of $\hat{y}$; let's call (that best possible) value $\mu$. Put formally, we want to solve the following optimization problem:

$$\mu = \mathop{\mathrm{argmin}}_{\hat{y} \in \mathcal{R}} MSE(\hat{y}) = \mathop{\mathrm{argmin}}_{\hat{y} \in \mathcal{R}} \mathbb{E}\left[(Y - \hat{y} )^2\right]$$

With some fancy manipulations (see Shalizi), we get that

$$MSE(\hat{y}) = (\mathbb{E}[Y] - \hat{y})^2 + \mathrm{Var}[Y]$$

Now, we want to find the optimal value for $\hat{y}$. Notice that the second term doesn't depend on our prediction, so we don't need to worry about it during optimization. So now, we want to find:

$$ \mu = \mathop{\mathrm{argmin}}_{\hat{y} \in \mathcal{R}} \, (\mathbb{E}[Y] - \hat{y}^2]$$

The solution to this is $\mu = \mathbb{E}[Y]$; that is, the best prediction to make is the expected value of the random variable!

Excercises

"Predict Only One" When you only have a sample

The above tells us what to do when we know $\mathbb{E}[Y]$. Sadly, to know that, we would need all of the data from the population. What to do? Well, our friend the law of large numbers tells us that if we have iid samples $y_1,...y_n$ from the population, then our sample mean---let's call that $\hat{\mu}$, converges to $\mathbb{E}[Y]$! So even those we don't know $\mathbb{E}[Y]$, we have something that converges to it as $n$ increases.

So, our "predict only one" optimal answer is the sample mean. Let's see what that looks like with our data from above!

How good of a guess is that?

Exercise (don't peek!): How would you evaluate how good of a guess this is for the two missing data points?

We'll return to how I would do it in a bit.

The Regression Function

In the meantime, let's look at the more interesting case. To this point, we've totally ignored the information that we have on the x-axis; i.e., what attempt number it is for me. Clearly, there is some valuable information there!

Put another way, if instead of just making a prediction based on our samples $y_1,...y_n$, we also used information from our $x_1,...,x_n$s, we should be able to make better predictions.

Put yet another way, then, we want our predictions to be a function of the $X$s!. Let's call that function $f(X)$.

Now we can again ask, what should the function be to minimize our MSE? In other words, what should we set $f$ as to minimize $MSE(f)$?

In the notation from above, we want to solve:

$$\mu(x) = \mathop{\mathrm{argmin}}_{f} MSE(f(X)) = \mathop{\mathrm{argmin}}_{f} \mathbb{E}\left[\left(Y - f(X)\right)^2\right]$$

With similar math as above (see Shalizi, 1.12-1.15 if you want to do it out yourself), we get that:

$$ \mu(x) = \mathbb{E}[Y | X = x] $$

That is, the optimal prediction in terms of MSE for $Y$ when we have information about $X$ is to take the conditional expected value at that value of $x$. This equation abouve is called the regression function.

Now what?

Let's come back to our machine learning perspective, now. When we say we want to define a model/hypothesis class, we can now say precisely what we are doing:

What model class might we choose that optimizes for our goals above---parsimony, efficient training, efficient prediction, and model expressivity? The most common answer to this question for a very long time in both statistics and machine learning has been a linear model. In a linear model, we choose to approximate $\mu(x)$ with a function that is linear in the model parameters.

In the case where $x$ is a single variable, then, we approximate $\mu(x)$ with the function $b_0 + b_1x$.

Exercise: Why don't we just use $b_1x$ as the approximation?

Returning to our simple example

Neat! Let's run a linear regression

Making and evaluating predictions

Uh. Not so good. What if we make the task a little less trivial?

A pre-Real world Real world example

California Housing Data

https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

Fitting a linear regression model on Boston data set

Which features to use?

Generally, you might be tempted to use features that are strongly correlated with the target (correlation closer to +1 or -1), e.g., LSTAT, RM. For linear regression, that might be the case, however, linear correlation is not the only indicator. You could have combinations of seemingly uncorrelated features that could yield a good prediction for the target.

Importance of intercept

What happens if out model did not have an intercept, i.e., we are forcing the hyper-plane to go through the origin?

We are going to check if having an intercept impacts the accuracy of the model. This is where we introduce the notion of training error vs. generalization error.

Having no intercept makes the model less accurate

Side point ... what's the deal with the whole train_test_split thing? We'll return to this later...

OK!

So we've seen the basics of linear regression. There are, however, a few things that we have yet to do: